[Kernels][FI] Skip trtllm attention when num_kv_heads=1 by yeqcharlotte · Pull Request #30842 · vllm-project/vllm

yeqcharlotte · 2025-12-17T05:41:53Z

Purpose

We got the following error when running a small model on blackwell

[WORKER]:  File "/redacted/path/executor/abstract.py", line 116, in initialize_from_config
[WORKER]:    self.collective_rpc("compile_or_warm_up_model")
[WORKER]:  File "/redacted/path/executor/uniproc_executor.py", line 75, in collective_rpc
[WORKER]:    result = run_method(self.driver_worker, method, args, kwargs)
[WORKER]:  File "/redacted/path/serial_utils.py", line 460, in run_method
[WORKER]:    return func(*args, **kwargs)
[WORKER]:  File "/redacted/path/worker/gpu_worker.py", line 444, in compile_or_warm_up_model
[WORKER]:    kernel_warmup(self)
[WORKER]:  File "/redacted/path/model_executor/warmup/kernel_warmup.py", line 68, in kernel_warmup
[WORKER]:    worker.model_runner._dummy_run(
[WORKER]:  File "/redacted/env/utils/_contextlib.py", line 116, in decorate_context
[WORKER]:    return func(*args, **kwargs)
[WORKER]:  File "/redacted/path/worker/gpu_model_runner.py", line 4130, in _dummy_run
[WORKER]:    outputs = self.model(
[WORKER]:  File "/redacted/path/compilation/cuda_graph.py", line 220, in __call__
[WORKER]:    return self.runnable(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1767, in _wrapped_call_impl
[WORKER]:    return self._call_impl(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1778, in _call_impl
[WORKER]:    return forward_call(*args, **kwargs)
[WORKER]:  File "/redacted/path/model.py", line 664, in forward
[WORKER]:    transformer_output, _ = self.model_core(
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1767, in _wrapped_call_impl
[WORKER]:    return self._call_impl(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1778, in _call_impl
[WORKER]:    return forward_call(*args, **kwargs)
[WORKER]:  File "/redacted/path/model/transformer.py", line 1804, in forward
[WORKER]:    h, cache = self.transformer_layers_forward(
[WORKER]:  File "/redacted/path/model/transformer.py", line 1930, in transformer_layers_forward
[WORKER]:    h, new_cache = layer_fn(
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1767, in _wrapped_call_impl
[WORKER]:    return self._call_impl(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1778, in _call_impl
[WORKER]:    return forward_call(*args, **kwargs)
[WORKER]:  File "/redacted/path/model/transformer.py", line 1103, in forward
[WORKER]:    residual_stream, new_cache = self.pre_feed_forward_processing(
[WORKER]:  File "/redacted/path/model/transformer.py", line 1045, in pre_feed_forward_processing
[WORKER]:    attn_out, new_cache = self.attention(  # Pre-norm applied inside attention
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1767, in _wrapped_call_impl
[WORKER]:    return self._call_impl(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1778, in _call_impl
[WORKER]:    return forward_call(*args, **kwargs)
[WORKER]:  File "/redacted/path/model/transformer.py", line 572, in forward
[WORKER]:    return self._attention_forward(
[WORKER]:  File "/redacted/path/model/transformer.py", line 704, in _attention_forward
[WORKER]:    output = self.attention(
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1767, in _wrapped_call_impl
[WORKER]:    return self._call_impl(*args, **kwargs)
[WORKER]:  File "/redacted/env/nn/modules/module.py", line 1778, in _call_impl
[WORKER]:    return forward_call(*args, **kwargs)
[WORKER]:  File "/redacted/path/model/custom_op.py", line 493, in forward
[WORKER]:    return method(*args, **kwargs)
[WORKER]:  File "/redacted/path/model/layers/paged_attention.py", line 268, in forward_native
[WORKER]:    output_vllm = self._attention.forward(query, key, value)
[WORKER]:  File "/redacted/path/attention/layer.py", line 367, in forward
[WORKER]:    torch.ops.vllm.unified_attention_with_output(
[WORKER]:  File "/redacted/env/_ops.py", line 1208, in __call__
[WORKER]:    return self._op(*args, **(kwargs or {}))
[WORKER]:  File "/redacted/path/attention/utils/kv_transfer_utils.py", line 39, in wrapper
[WORKER]:    return func(*args, **kwargs)
[WORKER]:  File "/redacted/path/attention/layer.py", line 869, in unified_attention_with_output
[WORKER]:    self.impl.forward(
[WORKER]:  File "/redacted/path/attention/backends/flashinfer.py", line 1295, in forward
[WORKER]:    trtllm_batch_context_with_kv_cache(
[WORKER]:  File "/redacted/env/flashinfer/prefill.py", line 3513, in trtllm_batch_context_with_kv_cache
[WORKER]:    run_func(
[WORKER]:  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
[WORKER]:RuntimeError: Error in function 'buildNdTmaDescriptor' at /workspace/include/flashinfer/trtllm/fmha/kernelParams.h:536: Check failed: false

Detailed errors

Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0xf64d740000
Shape: 64 16 1 7721128 7149852580857340533
Stride: 128 2 4096 337
tileShapes: 64 16 1 1 1024
tileStrides: 1 1 1 1 1969767282
swizzleType: 3

We confirmed that the dummy run failed when num_kv_heads=1 which causes stride of 0. in those case we can fall back to flashinfer's native attention instead of trtllm. Existing test cases ignore this any way.

Test Plan

 pytest -v tests/kernels/attention/test_flashinfer_trtllm_attention.py::test_trtllm_attention_rejects_num_kv_heads_1

Test Result

tests/kernels/attention/test_flashinfer_trtllm_attention.py::test_trtllm_attention_rejects_num_kv_heads_1 PASSED [100%]

========================================= warnings summary ==========================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================== 1 passed, 2 warnings in 18.72s ===================================

Our internal run succeed on this. Remaining will depend on CI.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

chatgpt-codex-connector · 2025-12-17T05:41:58Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request addresses a runtime error that occurs when using TRTLLM attention with num_kv_heads=1. The fix involves adding checks to disable this configuration and fall back to FlashInfer's native attention, which is the correct approach. A corresponding test case has been added to ensure this behavior is enforced. The changes are logical and well-implemented. I have one suggestion to improve the clarity and maintainability of the logic in can_use_trtllm_attention.

gemini-code-assist · 2025-12-17T05:43:59Z

vllm/utils/flashinfer.py

+    # num_kv_heads=1 is not supported due to TMA descriptor building limitations.
+    # When num_kv_heads=1, the KV cache strides become degenerate (stride_heads ==
+    # stride_batch), which causes CUDA's cuTensorMapEncodeTiled to fail because
+    # TMA descriptors cannot handle degenerate 4D tensors with singleton dimensions.
+    # See: https://fburl.com/352mrydz
+    if has_trtllm and num_kv_heads == 1:
+        logger.warning_once(
+            "TRTLLM attention does not support num_kv_heads=1. "
+            "This configuration causes TMA descriptor building to fail due to "
+            "degenerate tensor strides. Falling back to FlashInfer attention."
+        )
+    return has_trtllm and (num_qo_heads % num_kv_heads == 0) and (num_kv_heads != 1)


The logic to handle num_kv_heads=1 is correct, but its implementation could be simplified for better readability and maintainability. The current structure separates the warning log from the return logic, making it slightly convoluted. By using an early return for the num_kv_heads == 1 case, we can make the function's control flow more direct and easier to follow.

# num_kv_heads=1 is not supported due to TMA descriptor building limitations. # When num_kv_heads=1, the KV cache strides become degenerate (stride_heads == # stride_batch), which causes CUDA's cuTensorMapEncodeTiled to fail because # TMA descriptors cannot handle degenerate 4D tensors with singleton dimensions. # See: https://fburl.com/352mrydz if num_kv_heads == 1: if has_trtllm: logger.warning_once( "TRTLLM attention does not support num_kv_heads=1. " "This configuration causes TMA descriptor building to fail due to " "degenerate tensor strides. Falling back to FlashInfer attention." ) return False return has_trtllm and (num_qo_heads % num_kv_heads == 0)

houseroad · 2025-12-17T07:29:47Z

Or wondering if we can support num_kv_heads=1 in FlashInfer trtllm kernel, cc: @yzh119

yeqcharlotte · 2025-12-17T08:43:49Z

we run into the problem on a smaller debug model run. probably normal sized model wouldn't run into these issues. cc: @pavanimajety @mgoin if the change makes sense to you.

nvpohanh · 2025-12-17T14:10:44Z

@yeqcharlotte Could you file a FlashInfer GitHub issue https://github.com/flashinfer-ai/flashinfer with repro steps so that we can investigate this issue? Our expectation is that the trtllm attention kernel should support num_kv_heads=1 (namely, MQA). We have tested various MQA tests and it worked for us, so we want to fix this.

pavanimajety · 2025-12-17T17:34:40Z

@yeqcharlotte Did you see any log for the failed TMADescriptor? From this line in Flashinfer -
https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/trtllm/fmha/kernelParams.h#L528-L543

yeqcharlotte · 2025-12-17T18:49:11Z

@yeqcharlotte Did you see any log for the failed TMADescriptor? From this line in Flashinfer - https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/trtllm/fmha/kernelParams.h#L528-L543

@pavanimajety included it in the summary section:

Error: Failed to initialize the TMA descriptor due to invalid argument
tmaFormat: 9 dim: 4 gmem: 0xf64d740000
Shape: 64 16 1 7721128 7149852580857340533
Stride: 128 2 4096 337
tileShapes: 64 16 1 1 1024
tileStrides: 1 1 1 1 1969767282
swizzleType: 3

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

yzh119 · 2025-12-17T19:15:01Z

Looks like a misconfiguration of tma descriptor, the stride 1969767282 looks suspicious to me.

nvpohanh · 2025-12-18T04:44:53Z

@yeqcharlotte This PR broke GPT-OSS TP8 fuctionally (see #30919 ). Is it possible to revert this PR for now and apply this patch in your dev branch locally, until FlashInfer team finds out the root cause so that we can apply a more narrow check? The check num_kv_heads=1 is too broad and would affect many models. Among them, GPT-OSS is a very popular model and we don't want users to have bad user experience.

cc @mgoin

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…-project#30842)" This reverts commit a100152. This PR causes GPT-OSS-120B TP8 has functional issue(NotImplementedError: FlashInfer backend currently does not support attention sinks). Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com>

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

yeqcharlotte added 2 commits December 16, 2025 21:28

skip trtllm attention when kv_heads=1

0f26b84

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

add unit tests

03628e0

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

yeqcharlotte requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners December 17, 2025 05:41

mergify bot added the nvidia label Dec 17, 2025

github-project-automation bot added this to NVIDIA Dec 17, 2025

yeqcharlotte requested a review from pavanimajety December 17, 2025 05:43

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

houseroad approved these changes Dec 17, 2025

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Dec 17, 2025

yeqcharlotte enabled auto-merge (squash) December 17, 2025 07:45

yeqcharlotte disabled auto-merge December 17, 2025 07:45

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025

vllm-bot merged commit a100152 into vllm-project:main Dec 17, 2025
49 of 52 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Dec 17, 2025

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Dec 17, 2025

[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (vllm-project…

095b981

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

yeqcharlotte mentioned this pull request Dec 17, 2025

Failed to initialize the TMA descriptor when num_kv_heads=1 flashinfer-ai/flashinfer#2232

Open

shyeh25 mentioned this pull request Dec 18, 2025

[Bug]: GPT-OSS-120B NotImplementedError: FlashInfer backend currently does not support attention sinks #30919

Open

1 task

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (vllm-project…

fd5910e

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

vadiklyutiy mentioned this pull request Jan 8, 2026

Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… #31617

Merged

5 tasks

vadiklyutiy mentioned this pull request Jan 9, 2026

[MISC] Add strict contiguity check for FlashInfer attention tensors #32008

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (vllm-project…

4dfdf14

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (vllm-project…

7967c54

…#30842) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernels][FI] Skip trtllm attention when num_kv_heads=1#30842

[Kernels][FI] Skip trtllm attention when num_kv_heads=1#30842
vllm-bot merged 2 commits intovllm-project:mainfrom
yeqcharlotte:fi_kvhead1

yeqcharlotte commented Dec 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Uh oh!

houseroad commented Dec 17, 2025

Uh oh!

yeqcharlotte commented Dec 17, 2025

Uh oh!

Uh oh!

nvpohanh commented Dec 17, 2025

Uh oh!

pavanimajety commented Dec 17, 2025

Uh oh!

yeqcharlotte commented Dec 17, 2025

Uh oh!

yzh119 commented Dec 17, 2025

Uh oh!

nvpohanh commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

yeqcharlotte commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad commented Dec 17, 2025

Uh oh!

yeqcharlotte commented Dec 17, 2025

Uh oh!

Uh oh!

nvpohanh commented Dec 17, 2025

Uh oh!

pavanimajety commented Dec 17, 2025

Uh oh!

yeqcharlotte commented Dec 17, 2025

Uh oh!

yzh119 commented Dec 17, 2025

Uh oh!

nvpohanh commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yeqcharlotte commented Dec 17, 2025 •

edited by github-actions bot

Loading